Record Display Form 



Page 1 of 1 



Previous Doc 



Next Doc 
First Hit 



Go to Doc# 



□ ; ; Generate Collection 



L10: Entry 187 of 189 



File: JPAB 



Jul 19, 2001 



DOCUMENT- IDENTIFIER : JP 2001194983 A 

TITLE: INDOOR HEAT DISTRIBUTION SIMULATOR AND RECORDING MEDIUM 
Abstract Text (1) : 

PROBLEM TO BE SOLVED: To provide a simulator which enables a person planning to 
purchase a house or installing a heating apparatus to sufficiently learn and study 
the heat distribution in a room warmed by the various heating apparatus, and then 
to visually recognize the heat distribution so as to be made able to purchase the 
heating apparatus. 

Abstract Text (2) : 

SOLUTION: The heat distribution simulator has a first display means for displaying 
the state of a prescribed room inside 41 where the heating apparatus is arranged on 
a screen by three-dimensional CG images and a second displa y means for fetching the 
image data of the isothermal curved surfaces in the room when the heating apparatus 
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2 Jaeyoung Yang, Jooagmin Choi, Heekuck Oh 

parts without the help of semantics-based modules. Anotba difficulty of auto- 
matic wrapper generation comes from the heterogeneity of stores in a sense that 
different stores employ different mechanisms of manipulating customer queries 
and different styles of displaying product descriptions. 

Wrapper induction [7] has been suggested as a promising strategy to deal 
with these heterogeneous resources- Wrapper induction creates a wrapper by 
reasoning about a sample of the resour ce's pages. Scalable wrapper induction 
systems such as ShopBot[5] have been suggested that automatically builds the 
wrapper- through learning.. However, most of previous systems are limited in 
a sense that they exclude the cases where production descriptions have noises 
such as missing attributes or extra attributes, and also assume strong biases that 
must be satisfied for the learning algorithm to work As a result, many real- world 
stores that do not conform to these restrictions are unable to be recognized.. 

In this paper, we propose a more scalable comparison-shopping agent named 
MORPHEUS, which presents a simple but robust inductive: learning algorithm 
that automatically constructs wrappers It learns not only how to extract a 
store's content, but also how to query the stores During its learning phase, 
MORPHEUS receives the URL of a store as an input from the user, and gen- 
erates the query template and the pattern of a product description unit(PDU) 
as a learned result. The main idea of extracting the correct PDU pattern is to 
recognize the position and the structure of a PDU by finding the most frequent 
pattern from the sequence of logical line information in the output HTML pages.. 
This pattern is regarded as an extraction rule of the wrapper.. 

The characteristics of MORPHEUS can be described in three ways. First, 
it successfully constructs correct wrappers for most stores without assuming 
many strong biases and structural constraints. Second, MORPHEUS tolerates 
some noises that might be present in the production descriptions,. In particu- 
lar, MORPHEUS focuses on handling the cases where the product desciiptions 
have extra attributes or missing attributes.. Third, learning in MORPHEUS is 
processed rapidly since it does not include a pre-processing stage that removes 
redundant fragments such as heads, tails, and advertisements. 

Our eventual goal is to build an environment in which a per sonalized comparison- 
shopping agent can be created easily and rapidly. In this environment, different 
users will have their own comparison shoppers with distinct list of user-specified 
stores To realize this idea of personalized agent, an efficient, reliable, and flex- 
ible wrapper induction system is required, and we claim that MORPHEUS is 
accomplishing this job with satisfactory results. - 

This paper is organized as follows. Section 2 describes the overall architecture 
of MORPHEUS Section 3 concerns about wrapper- learning for online stores . 
Algorithms for analyzing input forms and identifying product description units 
are suggested., Section 4 gives the empirical results, and Section 5 compares 
MORPHEUS with the previous wrapper induction systems. Section 6 concludes 
with a brief summary and a description for- future work. 
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can be facilitated, and currently the options such as display by products, by 
pi ices, and by stores are available in MORPHEUS. 

The user profile contains uaei -related infoimation that is either provided 
directly by the user, or implicitly acquired by tracing user's actions.. This profile 
is managed to provide bettei service to an individual usei such as recommending 
new products that the user might be interested in and suggesting purchasing 
patterns of other people who have similar tastes.. 

Since we are more interested in wrapper learning, we focus on the wrapper 
generator module hereafter, although we implemented all of the above-mentioned 
modules. 

3 Learning Wr appers for Online Stores 
3.1 Analyzing Input Forms 

The wrapper generator learns the query scheme by analyzing the input forms 
in the HTML source of the page containing searchable input boxes Most stores 
use HTML's <FDRM> structure for implementing searchable input boxes, and the 
input box in which the actual keyword is typed is realized by the input tag with 
I EX I type attribute. Hence, the wrapper generator first checks if there is any 
<F0RM> structure in the HTML page that contains an <input type=lEXl „ . > 
tag inside. Once the correct input form is identified, a query template can be 
easily generated by extracting the value of the action attribute in the <F0RM> 
structure. 

A problem occurs when there are several <input type-lEXI . ..> tags in a 
page.. To select the best input tag in this situation, MORPHEUS evaluates each 
tag by feeding in a sample query and comparing the actual results One heuristic 
of decision making we adopt is that the input tag that results in the greatest 
number of correct product descriptions must be selected. Other heuristic that 
we plan to pursue is the use of labels that might be accompaoiied with the input 
tag, such as the label "Book Title" . In this case, the domain knowledge (or the 
ontology) has to play an important role since the meaning of a label can only be 
interpreted by referring to the ontology.. 

Fig. 2 shows the actual query template that is automatically extracted from 
the Amazon bookstore.. The template specifies the. CGI method executed(in this 
case, POST), the URL of the CGI routine, and some options about input key- 
words. This template is used in the wrapper interpreter by substituting the 
INPUI'-IEXl string in the template for the actual keywords the user typed in. 



-post htty://www.amazon.com/exec/obido 

2540219 -form tt index=books n "field-keywords=INPUT.TEXT» "Go=Go" 



Fig.. 2. The actual query template for Amazon 
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2 Overview of MORPHEUS 
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The oveiall architecture of MORPHEUS is shown in Fig.. 1. It consists of several 
modules including the wrapper generator, the wrapper interpreter, the output 
generator^ and the user profile.. 




Fig., 1.. The oveiall architecture of MORPHEUS 



The wrapper generator is a main learning module that constructs a wrapper 
for each store- For automaticity, the wrapper generator learns two things. First, 
it learns how to query a particular store by recognizing ite query scheme.. An 
HTML page containing a searchable input box is analyzed and a query template 
is generated. Second, it learns how to extract a store's content. Here, PDUs are 
recognized and the structure of the representative PDU of the store is determined 
by finding the most frequent pattern from the sequence of logical line inf oimation 
in the output HTML pages 

The wrapper interpreter is a module that executes learned wrappers to get 
the product search results After getting request keywords from the user', this 
module builds several actual queries by combining each store's query template 
with the keywords, and sends them to the corresponding shopper sites., The 
search results from the stores are then collected and fed to the output generator 
module. 

The output generator integrates search results from several on-line stores 
and forms a unified output . Different stores employ different display formats . For 
example, while some stores may produce a result page in which each product has 
three attributes such as Title, Manufacturer, and Price, other stores may have a 
result with different attributes such as Title, Size, and Price Uniform display of 
results is necessary for readability and easy comparison. Various display options 
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3. 2 Types of Search Results 

Besides learning input forms, the wrapper generator also has to learn the struc- 
ture of PDUs from the search result page. This is the most important topic of this 
paper . There are diver se foims of displaying product sear ch results, but thr ough 
a series of examinations of numerous output pages, we decided to classify the 
output display formats into two categories.. 

The first category is the table-type form which displays! search results in a 
tabular form.. This is a fully structured format. Normally, the first row of the 
table represents the header information of the table that tells the meaning of 
each column by specifying labels, so the product description can be analyzed by 
interpreting the meaning of the labels in the header . In thia case, the existence 
of domain knowledge is crucial to identify the meaning of each label. 

The second category is what we call the list type, which is a semi-structured 
format. In similar to the table-type, the list-type output consists of a list of 
product descriptions and each PDU has a number of components.. However, 
unlike the tabular form, there are not clear-cut boundaries; for separating one 
product description from others, as well as for separating one attribute of a 
product from other attributes of the same product. The usual way of delimiting 
these components is by using some heuristics such as by detxscting new lines. An 
example of a list type result is shown in Fig.. 4 in Section 3. 3. In fact, we can say 
that all the result pages that do not employ tabular forms belong to this case. 
Note also that the analysis of list-type PDUs is not much dependent upon the 
existence of domain knowledge . 

We originally implemented two separate extraction algorithms to deal with 
the two types of search results Naturally, these algorithms have assumed that 
somehow the output pages can be correctly identified as either type. We found, 
however, that determining the type of the current page is itself a formidable 
task, because there can be many nested levels of tables, among other things. 
Furthermore, our goal is to rmnimize the dependency on the domain knowledge, 
but the algorithm for the table-type display necessarily has to refer the domain 
knowledge to interpret the meaning of header labels. From these observations, 
we decided' to focus on the list-type display only, and developed an algorithm 
for extracting PDU patterns from this type of output pages.. Surprisingly, it 
turned out that this algorithm can also be applied to the table-type search 
results without any code modification and with the same error rate. Thus, we 
only describe the extraction algorithm for the list-type search results. 

3.3 Identifying Product Description Units 

In this section, we describe the algorithm that extracts PDU patterns from 
HTML-based output pages with the list-type search results. In general, a list- 
type page will have the form as depicted in Fig, 3.. In this figure, a and are the 
header and the tail, respectively. These axe redundant and k relevant fragments 
of the page {e.g., advertisements) that must be ignored, e denotes a PDU that 
must be extracted, and consists of a set of attributes that describe the product.. 
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Fig,, 4., A search result from the Amazon bookstore 
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Fig.. 5 . An HTML source for a PDU and the categories for logical lines 



example, n=3). We also employ a simple selection criterion in such a way that a 
candidate PDU pattern that occurs more than m/n times in a page, where m is 
the total number of PDUs in a page, becomes the representative PDU pattern. 
In this example, the third candidate Fig.. 4(c) is selected as the representative 
PDU pattern since its pattern occurs 37 times out of 51 total PDUs (e g., its 
frequency exceeds 33% of the total number of PDUs) in the page.. (This result is 
shown in Fig. 6 in the next section.) Once the representative PDU pattern is de- 
t^mined, MORPHEUS modifies other noisy PDUs by ignoring extra attributes 
or putting dummy values for missing attributes. 

4 Experimental Results 

We implemented MORPHEUS and built a Web interface jso that the user is 
able to specify the URL of a shopping mall that needs to be learned.. The 
web site for the test version of MORPHEUS is maintained with the URL of 
http://infoagent.hanyang.ac.ki / ~hkseo/demo/index .cgi. 

To evaluate the success rate of constructing correct wrappers, we have tested 
MORPHEUS for a total of 62 real online stores, 29 of widen display search 
results in tabular forms, and the other 33 sites in list forms. Keep in mind that 
the extraction algorithm for the list- type display is applied to both output types.. 
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Each attribute is denoted by fx- In case of the books domain, p can be the title, 
the author , the publisher, or the pi ice . Therefore, the objective of the learning 
algorithm will be to coirectly identify PDUs, and to determine the representative 
PDUe. 




Fig. 3 .. The general form of list- type results 



e can be obtained through the following steps. 

- Step 1: The HTML source of the result page for a sample query is saved into 
a file. 

- Step 2: The page is broken down into some logical lines The algorithm 
recognizes each logical line by examining delimiters such as the <bi > tag.. 
Hence, a logical line is conceptually similar to a line that the user sees in the 
browser. 

- Step 3: Each logical line is categorized by identifying its meaning. Currently, 
we classify each line into one of 4 categories, and each category is given a 
number . That is, 0 denotes the product name, 1 denotes t&e price, 5 denotes 
the line breaking tag such as <br>, and 9 is assigned to the line which is not 
recognizable as one of the above three. The product name can be recognized 
if a line contains one of the keywords in the sample query. The price can be 
recognized by finding a digit and some other symbols such as $.. The number 
of recognizable categories ar e currently limited because we are exploiting very 
little domain knowledge and focusing on extracting the price information 
Expanding categories for other non-price attributes is under progress : 

- Step 4: After the third step, the entire page is expressed by a sequence of 
numbers. The algorithm then finds a repeating pattern in this sequence. The 
most frequent pattern becomes the representative PDU & 

As an example, consider Fig. 4 that shows a part of a search result from 
the Amazon bookstore.. Fig. 5 presents the HTML source for Fig.. 4(c) with a 
category number for each logical line. 

The main advantage of the pr oposed algorithm is the ability to adapt dynam- 
ically to noisy situations in which some PDUs have extra attributes or missing 
attributes.. Consider again the result page in Fig 4 where tltree different kinds 
of PDUs exist in a page. Each candidate PDU has a distinct pattern, that is, 
Fig. 4(a) has a pattern of "029515", Fig.. 4(b) has "02951595", and Fig. 4(c) has 
"0295159595" . We assume that each pattern occurs in a page with the uniform 
probability of 1/n, where n is the number of distinct patterns in a page(in this 
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We also assumed that a test query is properly given so that ;&n output page with 
reasonably many matched products is produced (that is, we exclude the outputs 
with no matched products) . 

Fig. 6 shows a list of some of the 62 sites that have been tested. During the 
tests, we also collected some relevant data such as the total number of logical 
lines, the number of total PDUs, and the number of correct PDUS, all in a page 
The data in the table is just foi a single test query.. Fig. 7 shows the result 
of this experiment in terms of success rate of wrapper generation. As shown 
here, the proposed wrapper generation algorithm works perfectly for the table- 
type results(29/29), and it also works satisfactorily for the list- type results with 
succeeding 29 out of 33 stores. 



Store URL 


No. of LLs 


No. of to- 


No . of cor- 


Output 






tal PDUs 


rect PDUs 


type 


ypbsypbookscokr 


611 


10 


10 


List 


www csclub com 


2000 


20 


20 


Table 


www.amazon com 


1119 


51 


37 


List 


store yahoo .com 


317 


7. 


5 


List 


www . buy com 


1405 


3 


2 


List 



Fig.. 6.. Experiment data during wrapper generation for 62 sites 



Output type 


Avg no. of 
LLs 


Avg. no. of to- 
tal PDUs 


Avg. no. of 
correct PDUs 


Success rate 


List- type 


562.5 


29.67 


28.60 


87.8% (29/33) 


Table-type 


786.37 


9.75 


9.06 


100% (29/29) 



Fig.. 7.. The success rate of constructing wrappers for 62 stores 



Our algorithm works for most shopping malls, although there are a few sites 
that we are unable to cover for the following reasons.. First, the meaning of an 
attribute of a PDU might not be recognized due to irregularity- For example, 
we have defined a BNF grammar for identifying the price attribute, but if there 
is an extra HTML tag in a regular expression, MORPHEUS will be confusing.. 
Second, there is a chance of extracting a wrong pattern when the length of a 
pattern is too short, let's say just 1 logical line. In this case, the pattern becomes 
too general to be used to extract useful repeating information. Based on our 
experiment, the length of PDU should be at least 2 lines long to be correctly 
analyzed. Third, there are cases where a single logical line contains more than 
one attribute values. In this "multi-attributes in a single line' 1 case, MORPHEUS 
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ent font styles, and sequence of alphanumeric characters.. These heuristics are 
obtained mainly from the users rather than through learning, so the power of 
automatic wrapper learning in ARIADNE is limited,, 

6 Conclusion and Future Work 

We have pioposed a customized comparison shopping agent MORPHEUS. It 
successfully constructs correct wrappers for most stores without assuming many 
strong biases and structural constraints.. It also tolerates some noises that might 
be present in the production descriptions Eventually, MORPHEUS is an at- 
tempt to build an environment in which a customized comparison-shopping 
agent can be created easily. To realize this, an efficient, reliable, and flexible 
wrapper induction algorithm is designed and implemented, and its results are 
satisfactory. 

This technique can also be applied to other information integration systems 
for heterogeneous information sources such as personalized met a- search engines. 
We are planning to enhance the current system further, and build a full-fledged 
personalized shopping agent by employing the user profiles and adopting collab- 
orative filtering scheme. 
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is unable to integrate product attributes. The cuirent veision of MORPHEUS 
does not check whether a single line contains multi-attributes.. 

5 Related Work 

Several researches have been investigating the topic of information extraction 
from semi-structured documents,. In some systems such as TSIMMIS[6] and 
ARANEUS[2], extraction rules for the wrapper are written by humans. Bargain- 
Finder^], known as the first prototype as a comparison shopper, also builds the 
wrappers manually. To our knowledge, the wrappers in today's commercial com- 
parison shoppers, including MySimon[9], Price Watch[10], and BottomDollar[4], 
ar e all hand-coded, and consequently they suffer from the lack of scalability with 
new stores and format-changed stores . 

Automatic wrapper generation systems are suggested to cope with these 
problems. These systems build wrappers through inductive learning, and heavily 
rely on the regularities in the structure of the documents.. 

MORPHEUS is closely related to ShopBot[5] Both apply wr apper induction 
to the shopping domain, and learn a online vendor's query scheme as well as its 
content.. ShopBot finds patterns in the HTML source of the example document 
by partitioning the pages into a sequence of logical records, generating signatures 
for each record by removing non-HTML characters, and ranking the signatures 
by the fraction of the pages each accounts for In contrast, Mt)RPHEUS breaks 
each result page into logical lines(LLs) by examining visually salient HTML 
delimiters such as the <bi> tag. The main difference is that, in MORPHEUS, 
LLs are processes as is without being transformed to signatures, and each LL 
is assigned a category which tells what the line means.. The page is represented 
by a sequence of LL categories, and eventually, the most frequent subsequence 
of categories is determined as the representative PDU pattern ShopBot relies 
on a strong bias such that a product description must reside on a single line- 
MORPHEUS does not assume such a bias, and moreover it tolerates noises in 
the pattern of LLs, which is not possible in ShopBot . 

Kushmerick works extensively on wrapper induction for general semi-structured 
documents, not necessiarily concentrating on specific domains.. He suggests sev- 
eral wrapper classes including HLRT[7] that can be applied to HTML or other 
non-HTML documents. Kushmerick's wrapper techniques have the advantage of 
dramatically reducing the time and the effort required to produce a wrapper for 
information sources. One limitation of his approach is that wrapper classes are 
simple and can not handle the cases with noises such as missing attributes. As 
mentioned above, MORPHEUS can handle these noisy situations. 

STALKER[8] algorithm deals with the missing items or out-of-order items. 
However, STALKERS learning is not fully automatic in the sense that users 
need to be involved in the preparation of tr aining examples, 

ARIADNE[1] is a semi-automatic wrapper generation system, which is mainly 
targeted at hierarchically structured documents. ARIADNE uses a fixed set of 
heuristics about the formatting information such as relative font size, differ- 
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